-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: Fixed DataFrame.combine with non-unique columns #62760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Fixed DataFrame.combine with non-unique columns #62760
Conversation
|
I realized that I need to further think about the logic of my proposed fix. The .iloc usage is not capturing the intended columns to merge ...
|
…on-unique-columns
…on-unique-columns
pandas/core/frame.py
Outdated
| new_columns_out = self.columns.union(other_columns, sort=False) | ||
| # Deduplicate column names if necessary | ||
| self_columns = Index( | ||
| dedup_names(list(self_columns), False), dtype=self_columns.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would dedup_names be necessay here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having unique column names would preserve the original logic of using column names. Switching to indices would require multiple indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
pandas/core/frame.py
Outdated
| other_columns = Index( | ||
| dedup_names(list(other_columns), False), dtype=other_columns.dtype | ||
| ) | ||
| this.columns = Index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why alter this.columns?
pandas/core/frame.py
Outdated
| result = {} | ||
| for col in new_columns: | ||
| for col in new_columns_unique: | ||
| series = this[col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get rid of all the dedup_names stuff above and just iterate over range(this.shape[1]) and use series = this.iloc[:, i], other_series = other.iloc[:, i]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has to be other fixes because the logic heavily relied on column names instead of indices. I think this, other and new_columns(which result uses) would each need to have their own indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic heavily relied on column names instead of indices
how so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the docstring of the function there is an example in which one dataframe has columns A,B while other has B,C. In that case it would be tricky to use index instead of column name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On L9100 we call self.align(other). After that, the columns are aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what align is doing now after seeing the docsting example with different column names.
|
@jbrockmendel , I was able to use .iloc instead of column names. |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.